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Abstract 

Dirichlet Process(DP) is a Bayesian non-parametric prior for infinite mixture modeling, 
where the number of mixture components grows with the number of data items. The 
Hierarchical Dirichlet Process (HDP), often used for non-parametric topic modeling, is an 
extension of DP for grouped data, where each group is a mixture over shared mixture 
densities. The Nested Dirichlet Process (nDP), on the other hand, is an extension of the 
DP for learning group level distributions from data, simultaneously clustering the groups. 
It allows group level distributions to be shared across groups in a non-parametric setting, 
leading to a non-parametric mixture of mixtures. The nCRF extends the nDP for multi¬ 
level non-parametric mixture modeling, enabling modeling topic hierarchies. However, the 
nDP and nCRF do not allow sharing of distributions as required in many applications, 
motivating the need for multi-level non-parametric admixture modeling. We address this 
gap by proposing multi-level nested HDPs (nHDP) where the base distribution of the HDP 
is itself a HDP at each level thereby leading to admixtures of admixtures at each level. 

We motivate the need for nHDP by applying a two-level version of it for non-parametric 
entity topic modeling, where an inner HDP creates a countably infinite set of topic mixtures 
and associates them with entities, while an outer HDP associates documents with these 
entities or topic mixtures. Making use of a multi-level nested Chinese Restaurant Franchise 
(nCRF) representation for the nested HDP, we propose a collapsed Gibbs sampling based 
inference algorithm for the model. Because of couplings between various HDP levels, scaling 
up is naturally a challenge for the inference algorithm. We propose a scalable inference 
algorithm by extending the direct sampling scheme of the HDP to multiple levels. In our 
experiments for non-parametric entity topic modeling on two real world research corpora, 
we show that, even when large fractions of author entities are hidden, the nHDP is able 
to generalize significantly better than existing models. More importantly, using nHDP, we 
are able to detect missing authors at a reasonable level of accuracy. 

*. The first two authors have contributed equally to the paper 
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1. Introduction 


Dirichlet Process mixture models AntoniakJ (jlQTdH ] allow for non-parametric or infinite 
mixture modeling, where the number of densities or mixture components is not fixed ahead 
of time, but is allowed to grow (slowly) with the number of data items. This is achieved by 
using as a prior the Dirichlet Process (DP), which is a distribution over distributions, and 
has the additiona l property t hat d raws from it are discrete ( w.p. 1) with infinite suppor t 
Antoniak. ( 1974l i: Ferguson. ( 1973l i]. The popular LDA model D. Blei and Jordan. ( 2003l i] 


may be considered as a parametric restriction of the HDP mixture model. LDA and its non- 
parametric counterpart HD P have s ince been used extensively as a prior fo r modeling of text 


collections [ Blunsom et al. ( 20091 ): Sharif-razavian and Zollmann. ( 20081 )]. However, many 


applications require joint analysis of groups of data, such as a collection of text documents, 
where the mixture components, or topics (as they are called for text data), are shared across 
the documents. This calls for a coupling of multiple DPs, one for each document, where 
the base distribu t ion is discrete, and shared. The hierarchical Dirichlet Process (HDP) 
Y. Teh and Blei. ( 200(ll )] does so by placing a DP prior on a shared base distribution, so 


that the model now has two levels of DPs. 


The HDP mixture model be longs to the family of non-parametric admixture models 
E. Erosheva and Laffertv. ( 20041 )]. where each composite data item or group gets assigned 


to a mixture over the mixture components or topics, enabling group specific mixtures to 
share mixture components. Hence the HDP family leads to group level distributions with 
share mixture component distributions leading to a family of distributions over distribu¬ 
tions. While this adds more flexibility to the groups of data items, the ability to cluster 
groups themselves is lost, since each group now has a distinct mixture of topics associated 
with it. This additional capability is desire d in many applications. Eor instan ce, consider 


the analysis of patient profiles in hospitals A. Rodriguez and Gelfand.1 (j2008l l]. where we 


would like to cluster patients in each hospital and additionally cluster the hospitals with 
common distributions over patient prohles. This is achieved by constructing a DP mixture 
over possible group level distributions from which distribution for each hospital is drawn, 
thus clustering hospitals based on the specific group level distribution chosen. This DP 
mixture has a base distribution that is itself a DP (instead of a draw from a DP, like 
in the case of HDP), from which the group level distributions over patient profiles are 
drawn. Since the patient prohles are themselves appropriately chosen distributions, the 
nDP results in a distribution over distributions over distributions, unlike the HDP and the 
DP, which are distributions over distributions. The nDP model therefore becomes a prior 
for non-parametrically modeling mixture of mixtures over appropr i ately chosen component 
distributions. The nested CRP (nCRP) D. Blei and Tanenbaum. ( 20101 ])] . a closely related 
model, proposes a model for multi-level hierarchical mixture modeling to discover topic 
hierarchies of arbitrary depth through the predictive distribution obtained by integrating 
out the DP in a multi-level nDP. 


While the nDP family enables multi-level non-parametric mixture modeling, it is lim¬ 
ited by the fact that it does not allow sharing of mixture components across group specihc 
distributions at each level. Eor instance, in the previous example, group level distribu¬ 
tions in hospitals do not share mixture components (patient profiles). In several real world 
applications, a need arises for multi-level non-parametric mixture modeling where at each 
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level, group specific mixtures are required to share mixture components. This necessitates 
multi-level non-parametric admixture modeling. For instance, imagine a corpus containing 
descriptions related to entities, such as a shared set of researchers who have authored a 
large body of scientific literature, or a shared set of personalities discnssed across news arti¬ 
cles, such that each entity can be represented as a mixture of topics. Here, topic mixtures, 
corresponding to entities, are required to be shared across data groups or documents. In ad¬ 
dition, we would like topics themselves to be shared across the topic mixtures corresponding 
to entities. 

One could attempt to model this problem of non-parametric entity-topic modeling with 
nDP. The nDP can be imagined as first creating a discrete set of mixtures over topics, 
each mixture representing an entity, and then choosing exactly one of these entities for 
each document. In this sense, the nDP is a mixture of admixtures. However, a major 
shortcoming of the nDP for entity analysis is the restrictive assumption of a single entity 
being associated with a document. In research papers, multiple authors are associated with 
any document, and any news article typically discusses multiple news personalities. This 
requires each document to have a distribution over entities. In other words, we need a 
model that is an admixture of admixtures motivating the need for multi-level admixture 
modeling. 

In this paper, we address non-parametric multi-level admixture models. To the best of 
our knowledge, there is no prior work that addresses this problem. We propose the nested 
HDP (nHDP), comprising of multiple levels of HDP, where the base distribution of each 
HDP is itself an HDP. For inference nsing the nHDP, we propose the nested CRF (nCRF), 
which extends the Chinese Restaurant Franchise (CRF) analogy of the HDP to multiple 
levels by integrating out each HDP. However, due to strong coupling between the CRF 
layers, inference using the nCRF poses computational challenges. We propose a scalable 
algorithm for inference in the multi-level setting with a direct sampling scheme, based on 
that for the HDP, where the mixture component associated with an observation is directly 
sampled at each level , based on the connts of table assignments and stick-breaking weights 
at each of the levels. 

We apply the two-level nHDP to address the problem of non-parametric entity topic 
analysis for simultaneous discovery of entities and topics from document collections. The 
two-level nHDP belongs to the same class of models as a two-level nDP, in the sense that 
it specifies a distribution over distributions (entities) over distributions (topics). However, 
unlike the nDP, it first creates a discrete set of entities, and models each group as a document 
specihc mixture over these entities using a HDP. Similarly, it creates a discrete set of topics 
and models each entity as a distribution over these topics using another level of HDP leading 
to two levels of HDPs. Apart from addressing the novel problem of multi-level admixture 
modeling, to the best of our knowledge, ours is the first attempt at entity topic modeling 
that is non-parametric in both entities and topics. The Author Topic Model falls out as a 
parametric version of this model, when the entity set is observed for each document, and the 
number of topics is fixed. Using experiments over publication datasets using author entities 
from NIPS and DBLP, we show that the nHDP generalizes better under different levels 
of available author information. More interestingly, the model is able to detect authors 
completely hidden in the entire corpus with reasonable accuracy. 
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2. Related Work 


In this section, we review existing literature on Bayesian nonparametric modeling and entity- 
topic analysis. 

Baye sian Nonparametric Models: We will rev iew the Dirichlet Proces s (DP) |Ferguson. 


( I 973 II ]. the Hierarchical Dirichlet Process (HDP) Y. Teh and Blei. ( 2006l l] and the nested 
Dirichlet Process (nDP) A. Rodriguez and Gelfand.1 ( 20081 ) in det^l in the Sec. El 

The MLC-HDP [P. Wulsin and T.itt.l (1201 d l] is a 3-layer model proposed for human 


brain seizures data. The 2-level truncation of the model is closely related to the HDP 
and the nDP. Like the HDP, it shares mixture components across groups (documents) and 
assigns individual data points to the same set of mixtures, and like the nDP it clusters 
each of the groups or documents using a higher level mixture. In other words, this is a 
nonparametric mixture of admixtures, while our proposed nested HDP is a nonparametric 
admixture of admixtures. 

()20ld l] 


The nested Chinese Restaurant Process (nCRP) D. Blei and Tanenbaum 


ex¬ 
tends the Chinese Restaurant Process analogy of the Dirichlet Process to an infinitely- 
branched tree structure over restaurants to dehne a distribution over hnite length paths of 
trees. This can be used as a prior to learn hierarchical topics from documents, where each 
topic corresponds to a node of this tree, and each document is generated by a random path 
over these topics. The nCRP is also closely connected to the nDP in that the predictive dis¬ 
tribution obtained by integrating out the DPs at each level from a K-level nDP leads to an 
nCRP. However, while the nCRP and the nDP facilitate multi-level non-parametric mixture 
modeling, they are not suitable for modeling multi-level non-parametric admixtures. 

An extensio n to the nCRP model, also ca lled the nested HDP, has recently been pro¬ 
posed on Arvix ,1. Paisley and Jordan.l (120121 )]. In the spirit of the HDP, which has a top 
level DP and providing base distributions for document specific DPs, this model has a top 
level nCRP, which becomes the base distribution for document specihc nCRPs. In con¬ 
trast, our model for multi-level non-parametric admixtures has nested HDPs, in the sense 
that one HDP directly serves as the b ase distribution for another HDP, like in the nested 
DP A. Rodriguez and Gelfand. ( 20081 ')]. where one DP serves as the base distribution for 


another DP. This parallel with the nested DP motivates the nomenclature of our model as 
the nested HDP. 

Next, we briefly review prior work on entity-topic modeling, that involves simultaneously 
modeling entities and topics in documents, an application we use throughout the paper to 
motivate our model. The literature mostly contains parametric mod els, where the num¬ 


ber of topics and entities are known ahead of time. The LDA model D. Blei and Jordan 


( 2003 ! )] is the most popular parametric topic model, that infers a known number of latent 
topics from document collections. The LDA models the document as a distribution over a 
hnite set of topics and the topics a s distribution over words. The author-topic model (ATM) 


M. Rosen-Zvi and Smvth.l (j2004l ])] extends the LDA to capture known authors of each docu¬ 


ment by modeling a document as a unifom distribution over a known author set and authors 
as distributions over topics, which are themselves distribution over words. Hence, the ATM 
can be used for parametric entity-topic modeling where the authors corresp o nd to entities 
in documents. The Author Recipient Topic model A. McCallum and Wang. ( 20041 )] distin¬ 
guishes between sender and recipient entities and learns the topics and topic distributions 
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of sender-recipient pairs. In D. Newman and Smvth. ( 2006l i]. the authors analyze entity- 


topic relationships from textual data con taining entity words and topic words, which are 


pre-annotated. The Entity Topic Model H. Kim and Han. I (|2012l l] proposes a generative 


model which is parametric in both entities and topics and assumes observed entities for 
each document. 

There has been very little work on nonparametric entity-topic modeling, which would 
enable discovery of entities in settings where entities ar e partially or com p letely unobserved 


in documents. The Author Disambiguation Model, Dai and Storkev. ( 20091 )] 


IS a non¬ 
parametric model for the author entities along with topics. Primarily focusing on author 
disambiguation from noisy mentions of author names in documents, this model treats en¬ 
tities and topics symmetrically, generating entity-topic pairs from a DP prior. Contrary 
to this approach, our model is capable of treating the entity as a distribution over topics, 
thus explicitly modeling the fact that authors of documents have preferences over specihc 
topics. We perform experiments in section [3 to demonstrate the effectiveness of our model 
for non-parametric entity topic analysis. 


3. Background 

Consider a setting where observations are organized in groups. Let Xji denote the i-th. 
observation in j-th group. For a corpus of documents, Xji is the i-th word occurrence in the 
j-th document. In the context of this paper, we will use group synonymously with document, 
data item with word in a document. We assume that each xji is independently drawn from a 
mixture model and has a mixture component parameterized by a factor, say 9ji, representing 
a topic, associated with it. We let these factors themselves be drawn independantly from a 
distribution H. For each group j, let the associated factors 6j = {6ji ,9j2, ■ ■ .) have a prior 
distribution Gj. Finally, let F{6ji) denote the distribution of Xji given factor 6ji. Therefore, 
the generative model is given by 

eji\Gj ~ Gj- Xji\ej^ ~ FiOji), Vj,z (I) 


The central question in analyzing a corpus of documents is the parametrization of the 
Gj distribut ions — what paranieters to share and what priors to place on them. The 
LDA model D. Blei and Jordan. ( 2003l l] is the most popular parametric topic model, that 
assumes Gj ~ Dir{a/K) is a distribution over a hnite number of k topics for each document. 
The choice of Dirichlet prior is based on the conjugacy of the Dirichlet distribution with 
the multinomial, that leads to efficient inference. However, in most realistic scenarios, the 
number of topics K is not known in advance. 

Bayesian Non-parametric modeling, is a paradigm that enables us to choose a prior 
for Gj that allows for a countably infinite number of mixture components. This enables 
working with mixture models without having to fix the number of mixture components in 
advance by working with Gj of the form Gj = with atoms 4>k ^ H, a base 

distribution. We start with such a prior, the Dirichlet Process that considers each of the 
Gj distributions in isolation, then the Hierarchical Dirichlet Process that ensures sharing 
of atoms among the different GjS, and finally the nested Dirichlet Process that additionally 
clusters the groups by ensuring that all the GjS are not distinct. 
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Dirichlet Process: We start with a formal definition of the Dirichlet process as a prior for 


the G-i distribution. Let (0, B) be a measurable space. A Dirichlet Process (DP) Ferguson 


(Il97^ h lAntoniak.1 (ll974l V] is a measure over measures Gj on that space. Let 77 be a finite 


measure on the space. Let a be a positive real number. We say that Gj is DP distributed 


with concentration parameter a and base distribution H, written Gj 
finite measurable partition (Ai,..., Aj.) of 0, we have 

(GjiAi), ... G,iAr)) ~ Dir{aH{Ai), ..., aH{Ar)). 


jDP(a, 77), if for any 


( 2 ) 


The stick-breaking representation provides a constructive definition for s a.mples drawn 
from a DP, by explicitly drawing the mixture weights for Gj. It can be shown Sethuraman. 
( I 994 I )] that a draw Gj from DP{a,H) can be written as 


H, fc = 1... 00 ; 


Wi ~ Beta{l, a); /3i = Wi nj=i(f “ '^j) 
~ Sfc=l 


(3) 


where the atoms (f>k are drawn independently from H and the corresponding weights {/?fc} 
follow a stick breaking construction. This is also called the GEM distribution: {Pk)^i ~ 
GEM(a). The stick breaking construction shows that draws from the DP are necessarily 
discrete, with infinite support, and the DP therefore is suitable as a prior distribution on 
mixture components for ‘infinite’ mixture models. Subsequently, {Oji} are drawn from Gj, 
followed by draws {xji} (similar to Eqn. [T]). The generation of Gj from the DP prior 
followe d by the genera tion of {Oji} and {xjj} constitutes the Dirichlet Process mixture 


model Eerguson. ( 197.‘ll )]. 


Another com monly used perspective of the DP is the Chinese Restaurant Process (GRP) 
Pitman. ( 20021 )] which shows that DP tends to clusters draws Oji from Gj. Let {Oji} denote 


the sequence of draws from Gj, and let {4>k} be the atoms of Gj. The GRP considers the 
predictive distribution of the i-th draw Oji given the first i — 1 draws Oji... Oji-i after 
integrating out Gj-. 


K 

0ji\0ji,..., Oji-i, ~ T 

k=i * 


^jk 

— 1 + a 




a 


1 + a 


-77 


(4) 


where Ujk = ‘/'fc)- The above conditional may be understood in terms of the 

following restaurant analogy. Gonsider an initially empty ‘restaurant’ with index j that can 
accommodate an infinite number of ‘tables’. The i-th ‘customer’ entering the restaurant 
chooses a table Oji for himself, conditioned on the seating arrangement of all previous 
customers. He chooses the /c-th table with probability proportional to Uj}^, the number of 
people already seated at the table, and with probability proportional to a, he chooses a new 
(currently unoccupied) table. Whenever a new table is chosen, a new ‘dish’ (f>k is drawn 
{(j)k ~ H) and associated with the table. The GRP readily lends itself to sampling-based 
inference strategies for the DP. 

Hierarchical Dirichlet Process: Now reconsider our grouped data setting. If each Gj is 
drawn independently from a DP, then w.p. 1 the atoms {(pjk}'^i for each Gj are distinct, 
when H, the base distribution is continuous. This would mean that there is no shared 


6 






























topic across documents , which is undesirable. The Hierarchical Dirichlet Process (HDP) 
Y. Teh and Blei. ( 2006l i] addresses this problem by modeling the base distribution of the 
DP prior in turn as a draw Gb from a DP, instead of the continuous distribution H. Since 
draws from a DP are discrete, this ensures that the same atoms {(pk} are shared across all 
the GjS. Specifically, given a distribution H on the space (0, B) and positive real numbers 
and 7 , we denote as HDP(q;, 7 , .H) the following generative process: 


GBb,H DP{j,H) 

Gj\aj,GB ^ DP{aj,GB) Vj. (5) 


When the generation of GjS as described in Eqn. [5] is followed by generation of {9ji} and 
{xji} as in Eqn. [U we get the HDP mixture model. 

Using the stick-breaking construction, the global measure Gb distributed as Dirichlet 
process can be expressed as Gb = h^cpk-: where the topics (pk as before are drawn from 

H independently {(pk ~ H) and the stick-breaking weights (3 ~ GEM( 7 ) represent ‘global’ 
popularities of these topics. Since Gb has as its support the topics (p, each group-specific 
distribution Gj necessarily has support at these topics, and can be written as follows: 

CX) 

Gj = {T^jk)T=i ~ DP(q;j,/3) (6) 

k=l 


where 7Vj = denotes the topic popularities for the jth group. 

Analogously to the CRP for the DP, the Chinese Restaurant Eranchise provides an 
interpretation of predictive distribution for the next draw from an HDP after integrating out 
the GjS and Gb- Let {Oji} denote the sequence of draws from each Gj, {ipjt} the sequence 
of draws from Gb, and the sequence of draws from H. Then the conditional 

distribution of 6ji given Oji,, 6jj-i and Gb, after integrating out Gj is as follows (similar 
to that in Eqn. 01 ): 


9ji\9ji, ■ ■ ■, 9j,i—i, OL, Gb ~ ^ ^ . 


rijt. 


t=i 


i — 1 + a 


^tpjt + 


a 


Uj.. + a 


-Gb 


(7) 


where rijtk = Y^l/J:i^i0ji','^jt)S{tpjt,<Pk), rnjk = and dots indicate marginal 

counts. As Gb is also distributed according to a Dirichlet Process, we can integrate it out 
similarly to get the conditional distribution of xpjt'. 


K 

'Ipjt\'pii,'ipi2, - ■ ■ ,^^21, - ■ ■ ,'Pjt-i,7,H r-- Y ^4>k - 

-I -7 m ..-|-7 

These equations may be interpreted using a restaurant analogy with tables and dishes. 
Consider a set of restaurants, one corresponding to each group. Customers entering each of 
the restaurants select a table 6ji according a group specific CRP (Eqn [7]). The restaurants 
share a common menu of dishes {(pk}- Dishes are assigned to the tables of each restaurant 
according to another CRP (Eqn [ 8 ]). Let tji be the (table) index of the element of {'ipjt}j 
associated with Oji, and let kjt be the (dish) index of the element of associated with 
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'ijjjt- Then the two conditional distributions above can also be written in terms of the indexes 
{tji} and {kjt} instead of referring to the distributions directly. If we draw via choosing 
a summation term, we set ijjjt = and let kjt = k for the chosen k. If the second term is 
chosen, we increment K hy 1 and draw cjiK ~ H and set ifjjt = (t)K and kjt = K. This CRT 
analog y leads to efficien t Gibb s sampling-based inference strategies for the HDP mixture 
model Y. Teh and Blei. ( 20061 )]. 

Nested Dirichlet Process: In other applications of grouped data, we may want to 
cluster observations in each group by learning group specific a mixture distributions and si¬ 
multaneously cluster these group specihc distributions inducing a clustering over the groups 
themselves. For example, when analyzing patient records in multiple hospitals, we may want 
to cluster the patients in each hospital by learning a distribution over patient profiles and 
cluster hospitals having the same distribution over patient profiles. The HDP cannot do 
this, since each gr oup specific mixture Gt is distinct.T his problem is addressed by the nested 
Dirichlet Process [A. Rodriguez and Gelfand.1 ((20081)]. 


Th is problem is addressed by the nested Dirichlet Process [A. Rodriguez and Gelfand. 
(20081)]. which first defines a set of distributions with an infinite support: 


G? = E 

k=l 


0 JT 




GEM{-f° 


(9) 


and then draws the group specific distributions, that we now term as Gj, from a mixture 
over these set of {G)!}: 


G] ~ G], = {/3°} ~ GEM{y^) 

r=l 

We denote the generation process as {Gj} ~ , H). The process ensures non-zero 

probability of different groups selecting the same G^, leading to clustering of the groups 
themselves. Using Eqn. [3l the draws {Gj} can be characterized as: 

G] ~ G|j, G], ~ DP{y\DP{y^,H)) (10) 

where the base distribution of the outer DP is in turn another DP, unlike the HDP where it is 
DP distributed. Thus the nDP can be viewed as a distribution on the space of distributions 
on distributions. 

The nDP can be expressed with the following restaurant analogy with two levels of 
restaurants. Each group (hospital/document) is associated with an ‘outer’ level 1 restau¬ 
rant while each distribution G® corresponds to an ‘inner’ level 0 restaurant. Each outer 
restaurant picks a distribution Gj, through picking a ’dish’ from a global menu of dishes 
across outer restaurants based on the dish’s popularity according to G^. Each dish in this 
menu, that corresponds to a unique inner restaurant, defines a specific distribution over 
patient prohles. Hence each outer restaurant gets a distribution corresponding to one of 
the inner restaurants through this process, leading to a grouping of the outer restaurants 
(hospitals) based on the inner restaurant (distribution over patient profiles) chosen. The 
customer entering an outer restaurant j goes to the corresponding inner restaurant, with 
















index r, such that Gj = Now the customer selects a table in this restaurant, with the 
index, say, k. The data is generated from the corresponding 

A Note on Notation: nDP brings to focus the idea of nesting, where the the distributions 
at one level ({G^} at level 0) are themselves atoms for the next level (level 1 mixture 
distribution G^). Hence, with the nDP, we introduce the notion of levels into our notation 
through superscripts for random variables. For the rest of the paper the superscript of a 
random variable indicates the level of the variable. Table ?? shows a ready summary of the 
notation used through the rest of the paper. 

Nested Chinese Restaurant Process: The nDP can be viewed as a tool for building 
a non-parametric mixture of m ixtures. The Nested Chinese Restaurant Process (nCRP) 


D. Blei and Tanenbaum. ( 201(11 )]. is a closely related model for multi-level clustering. The 


nCRP extends CRP by creating an infinitely-branched tree structure over restaurants to 
dehne a distribution over finite length paths of trees for modeling topic hierarchies from 
documents. The nCRP can be interpreted wi th a restaurant analogy consistin g of multiple 


levels of restaurants as follows as described in D. Blei and Tanenbaum. ( 20101 )]. “ A tourist 


arrives at the city for an culinary vacation. On the first evening, he enters the root Chinese 
restaurant and selects a dish using the CRP distribution, based on its popularity (equation 
W- On the second evening, he goes to the restaurant identied on the first nights dish and 
chooses a second dish using a CRP distribution based on the popularity of the dishes in the 
second nights restaurant. He repeats this process forever.^' The nCRP however is closely 
connected to the nDP since a K-level nCRP can be obtained by integrating out the DP at 
each level in a K-level nDP facilitating multi-level non-parametric mixture models. 

Multi-level Admixture models: The nDP enables modeling a non-parametric mix¬ 
ture of non-parametric mixtures, while the nCRP provides a hierarchical prior for multilevel 
non-parametric mixture models. In other words, the multi-level nDP leads to a prior where 
each distribution at a specific level I, is a mixture over a distinct set of distributions at the 
previous level I — 1. Hence, there are no atoms in common between distributions at each 
level. The nDP and multi-level nDP are therefore not suited for applications that require 
mixture components to be shared across group specihc distributions at each level. Sev¬ 
eral real world scenarios are however more effectively modeled by multi-level admixture 
models where each level has a group of distributions which share mixture components. 

A example of entity-topic modeling for document collections clearly illustrates the limi¬ 
tation of existing models. Here, we would like to model documents as having distributions 
over a set of latent entities, with multiple documents sharing entities. We would like to 
model the entities themselves as distributions over a set of latent topics, with the ability for 
multiple entities to share topics. This constitutes a two level admixture model, where group 
specific distributions at one level (the ’entity’ distributions over topics) must share atoms 
(topics), which are themselves distributions at the previous level (the ’topic’ distribution 
over words). 

The author-topic model (ATM) 


M. Rosen-Zvi and Smvth.l (j2004l i]. an extension of 


LDA, captures this modeling scenario for the parametric case where the entities(authors) 
for each document are observed and the number of topics is known in advance. Consider 
a corpus containing A authors. The ATM captures known authors, Aj C A of each docu¬ 
ment, by modeling documents as a uniform distributions {Gj} over corresponding sets of 
authors Aj and authors as distributions {G®} over K topics. The words are generated by 
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first sampling one of the known authors of the document (with holding the global 
index of this author), followed by sampling a topic from the topic distribution of that 
author : 


e],\G] ~ G]; e%\Gl 4 = r ~ GO; xj,\e% ~ Vj, f 


( 11 ) 


The ATM however cannot handle a more realistic scenario of non-parametric modeling 
where the number of topics is not fixed in advance and author set for each document is not 
fully observed. Such an application calls for multi-level non-parametric admixture 
modeling, a previously unexplored problem. Motivated by this, we propose the nested 
Hierarchical Dirichlet Process (nHDP) for multi-level non-parametric admixture modeling. 


4. Nested Hierarchical Dirichlet Processes 

In this section, we introduce the Nested Hierarchical Dirichlet Processes. For this, we hrst 
introduce 2-nHDP i.e. the two level nested HDP for non-parametric modeling of entities 
and topics and then generalize this to L-nHDP for any given number of L levels. 


4.1 Two-level Nonparametric Admixture Model 


Recall that in M. Rosen-Zvi and Smvth.l pOOJ)], the authors approach the problem of 


modeling the topics and entities for the application of author-topic modeling by taking a 
two level approach. Our aim is to build a 2-level admixture 2-nHDP for a non-parametric 
treatment of this problem. However, before this, we first present a simpler intermediate 
model which we call DP-HDP, an extension of nDP, for ungrouped data, where the words 
are not grouped into documents, leading to a mixture of admixture model. (This can 
also be interpreted as a usecase for single document analysis instead of a collection of 
documents). We then gradually extend it for grouped data (multiple documents) to build 
2-nHDP modeling non-parametric admixtures of admixtures. We next generalize this to 
(L-l-l)-nHDP in section 021 


DP-HDP for Ungrouped Data: Consider an entity-topic modeling scenario where the 
observed data i.e. set of words is not grouped as documents. One could conceive performing 
such two-level modeling for such data with the nDP. In nDP, entities are of equation 

[9] with 4> as the topic variables drawn from a base distribution H. However, the nDP is 
unsuitable for such analysis, since the entities drawn from a DP, with a continuous base 
distribution H , do not share topic atoms. This can be modihed by first creating a set 
of entities such that they share topics. One way to do this is to follow the HDP 

construction for entities: 


G°~H'DP({a°}, 7 °,H'),r = l...oo (12) 

This can be followed by drawing the entity for each word i from a mixture over the G(!s: 

OO 

G.' ~ (3^600, ~ GEM{j^) (13) 

r=l 

This may be interpreted as creating a countable set of entities {G)?} by defining topic pref¬ 
erences (distributions over topics) for each of them, and then defining a ‘global popularity’ 


10 





of the entities. Using Eqn. O we observe that ~ DP{'y^, HDP{ {a^},^^,H)). 
Observe the relationship with the nDP (Eqn. [T0]l . Like nDP, this also defines a distribution 
over the space of distributions on distributions. But, instead of a DP base distribution for 
the outer DP, we have achieved sharing of topics using a HDP base distribution. We will 
write Gj ~ DP-HDP( 7 ^, {ce^}, 7 *^, 5^). 

Note that multiple words can choose the same entity. As before, entity Gj can now be 
used as prior for sampling topics, say {0^} for individual words which chose that entity, 
using 

e^r^Gj, Xjr^Fie,) (14) 

We will call this the DP-HDP mixture model. Note that one can also alternatively use this 
model for grouped data where each group or document is associated with a single entity 
and each word in the document chooses topic as per the entity distribution over topics. 

2-nHDP for Grouped Data: In this section, we extend the earlier model for grouped 
data since most of the applications use multiple documents e.g. in the form of news articles, 
scientific literature, images, etc. 

We extend the approach presented in § ?? to the setting of grouped data since most 
applications use multiple documents e.g. news articles, scientihc literature, images, etc. 
In the single entity model, since a document is associated with one entity, a single entity 
is sampled for all the words in the document. Now, in the case of multiple entities per 
document, first we sample an entity for each word in the document, and then a topic is 
sampled according to the entity specific distribution of topics. 

As in the previous model, we hrst create a set of entities as distributions over a 

common set of topics {(pk ~ H) by drawing independently from an HDP (Equation 

[HD, and then create a global mixture G^ over these entities (Equation [T^ . 

Earlier in the absence of groupings, this global popularity was used to sample entities 
for all the words. Now, for each document j, we define a local popularity of entities, derived 
from their global popularity G^^: 


G] = E4'^go, {4} ~ DP{a],l3^) (15) 

r=l 

Now, sampling each factor 6j- in group j is preceded by choosing an entity rsj G]hy 
sampling according to local entity popularity Gj. Note that P{dji = G°) = 4 . 

Note that the above equation [15] is similar to the stick breaking definition of HDP in 
Equation m We can see that Gj is drawn from a HDP with the base distribution over atoms 
{G^} instead of topics {cpk}- This distribution over {G°} is again an HDP. Therefore, we 
can write: 

9], ~ Gj ~ HDP({«j},y,HDP(K},70,.H)) (16) 

We refer to the two HDPs as the inner and outer HDPs and hence, call this as 2-nHDP. We 
can write 0jj ~ 2 — HDP{{aj}, 7 ^, {a°}, 7 °, H). Similar to the nDP and the DP-HDP (Eqn. 
M), this again defines a distribution over the space of distributions over distributions. The 
2-HDP mixture model is completely defined by subsequently sampling ~ 9j-, followed 
by Xji ~ F{e^i). 
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An alternative characterization of the 2-nHDP mixture model is using the topic index 

and entity index corresponding to Xjf. 

7r° ~ A)P(a°,/3°); (/>fc ~ k,r = l...oo 
~ GEM{j^) ;7r) ~ DP{a],P^), j = 1...M 
Zji ~ tt] ; Zji ~ 7r°i; xji ~ F{(t)^o), i = l...nj (17) 

This may be understood as first creating a entity-specific distributions tt)? over topics 
using global topic popularities followed by creation of document-specific distributions 
vrj over entities using global entity popularities /3^. Using these parameters, the content 
of the document is generated by sampling repeatedly in iid fashion an entity index 
using ttI , a topic index z^, using and finally a word using F {4> o ). 

Observe the connection with the ATM in Eqn. [TTJ The main difference is the the set 
of entities and topics is infinite. Separately, each document now has a distinct non-uniform 
distribution 7rj over entities. 

(Move the following to/before background....?) 

Also, observe that we have preserved the HDP notation to the extent possible, to facili¬ 
tate understanding. To distinguish between variables corresponding to the two HDPs levels 
in this model, we use the superscript 0 for symbols corresponding to the the inner HDP 
modeling entities as distributions as topics and superscript 1 for symbols corresponding to 
the the outer HDP modeling documents as distributions over entities. Going forward, we 
follow the same convention for naming variables in the multi-level HDP with multiple levels 
of nesting. 

4.2 Multi-level Non-parametric Admixture modeling 

We now present (L-l-l)-nHDP, a generalized extension to 2-HDP proposed in the previous 
section SH that can be used for multi-level non-parametric admixture modeling. 

The 2-nHDP was constructed by first creating a set of entities, by drawing 

each of these distributions from an inner HDP with base distribution H. This is followed 
by drawing document specific distributions at the outermost level from the outer 

HDP, with the base distribution as the inner HDP. To extend this to multiple levels, at 
each level, we draw group level distributions from an HDP with the base distribution at as 
the previous level HDP. 

Let L -|- 1 denote the number of levels of nesting, indexed by I € {0,..., L}. Through 
the rest of this section, the superscript of a random variable denotes the level of the random 
variable. The nested HDP comprises of multiple levels of HDPs, where the base distribution 
of HDP at level I is the the HDP at level I — 1. The innermost level is 0 while the outer 
most level is L. The groups in the outermost level L correspond to documents in the case of 
entity topic modeling. At the inner most level 0, we have a HDP, with a base distribution 
H from which the inner most level entities are drawn. In the case of entity topic modeling 
these inner level entities are topics that are modeled as a distributions over words. 

At level 0, the inner most level, we draw level-1 entities from a HDP with 

base distribution H. This step corresponds to equation [12] of the 2-nHDP and constitutes a 
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non-parametric admixture over atoms drawn from H. Note that in case of two-level models, 
we had termed {Gj!} as entities. In case of this multi-level model, we term these entities as 
level-1 entities and topics can be considered as level-0 entities. Hence, at level 0, we have 

= r = l,... (18) 

OO 

Alternately expressed as, ~ GEM{'y^); G% = Pk^<pk 

k=l 

OO 

= '^rk^4>k where ~ DP{a^, /3°), r = 1,... 

k=l 

We denote the HDP distribution itself at level 0 by which subsequently becomes the 
base distribution for next level HDPs. At any level I G {1, 2,... , L — 1}, becomes the 
base distribution of the level HDP, while the group level distributions at the previous 
level, A: € 1, 2,..., become the atoms for the group level distributions that we construct 
at the level, G^, r = 1, 2,..., 

G(. ~ = H', r = l,... (19) 

OO 

Alternately expressed as, (5^ ~ GEM{'y^), and 

k=l 

OO 

'^r = y] where {vr^.^} ~ DP{ai, /?'), r = 1,... 

k=l 

For the HDP at the outermost level L, the base distribution is , the HDP from the 

previous level. At this level we have a set of M groups, that correspond to the number of 
documents in the case of document modeling. While it is possible to develop a multilevel 
admixture model where the number of groups is unobserved at every level, in this paper, 
we assume the number of groups at the outermost level to be an observed quantity in a 
fashion aligned with the document modeling usecase. Hence, at level L, we have, 

G^ HD= H^, r = l,... (20) 

OO 

Alternately expressed as, ~ GEM{'y^), and G^ = 

k=l 

OO 

Gj =Y^ where {vrj^fc} ~ DP{af,P^), r = 1,... 

i=i 

Each observed data item i G 1,... ,Nj that resides with one of the outermost groups j G 
{1,..., A/} is now associated with an entity (group level distribution) from each HDP level 
I , which itself is a distribution over entities drawn from the previous level I — 1 HDP. Hence 
we generate the data as follows. First generate ~ G^ from the group level distribution 
at the outermost group j. For any level I G {L — 1,..., 0}, we select 9j^ ~ • Note that 

~ sampled is equal to one of the variables, (which are themselves 

distributions over atoms drawn from previous level HDP). 9j- is equal to one of {4'k}^i at 
the inner most level zero. Finally data items are generated as Xji r-u nd%)- 
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Similar to the 2-nHDP, (L+l)-nHDP can be defined using the index of the atom 
at each level I corresponding to data item Xji as follows. 


^ GEM{'y^); ^ DP{a^, 13^); ~ k,r = l...oo 

13^ ~ GEM{j^) ; ttI ~ DP{al, /?'), r = l...oo,/ = l...L-l 
/3^ ~ GEM{j^) ;irf ~ DP{af,p^), j = l...M 
Zji ~ TTj ; Zji ~ Vij Xji ~ F{4>^oJ, i = l...nj,l = l...L-l 


( 21 ) 


4.3 Nested Chinese Restaurant Franchise 

In this section, we derive the predictive distribution for the next draw 6^- at various levels 
from the nHDP given previous draws, after integrating out the various group level distri¬ 
butions {G(,} and at each level. We also provide a restaurant analogy for the nHDP in 
terms of multiple levels of nested CRFs, corresponding to the multiple levels of HDP. This 
will be useful for the inference algorithm that we describe in Section [5l 

We start with the outermost level L. Let denote the sequence of draws from 

Gj, and denote the sequence of draws from G^. Then the conditional distribution 

of Oj^ given all previous draws after integrating out Gj looks as follows: 


T- T 


af, G| ~ , 

J J J ^ J 1 _|_ Q, Vjt J 
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— 1 -f- o: 


L^B 


( 22 ) 


where ^)- Next, we integrate out G^, which 

also distributed according to Dirichlet process: 
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(23) 


Note that here refers to the number of unique atoms G^ ^ already drawn from 
the base HDP of H^~^. Observe that each variable gets assigned to one of the G^~^ 
variables, from which is drawn (recall 0j'i~^ ~ 0j"j). Hence, the predictive distribution 
for 0^i~^, given = G^~^ is obtained by integrating out the corresponding grouplevel 
distribution G^~^. Similarly, for any general level I, given that 0^-^^ = G^^, 9^^ ~ 
drawn by integrating out the group level distribution G^. Hence, for level I G {L — 
let = {9j/ii '■ 9^^^ = G(., Vi', j' < j, and i' < i, j' = j} denote the sequence of previous 
draws from G(,. Hence, 
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where = Yl,e\, the number of times component was picked. As 

is also distributed according to a Dirichlet Process, we can integrate it out similarly and 
write the conditional distribution of '0^ as follows with = Ylt ^), and ^ is 

the previous level HDP : 


V’ri I V’ll, V'12, • • • , V’21, • • • ,7,H^ ^ 


rsj 


k'- 


E 


■"'t /j 

m.. + y 


1-1 + 
k 


m.. + 7‘ 


(25) 


At level 0, the predictive distribution for 6^^, given 01 = G° can be obtained by integrating 
out G° replacing I with 0 in equation [Ml Similarly, the predictive distribution for {V’rt}) 
draws from G^, can be obtained by integrating out G^ as follows. 




0 
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2 ° 1 ,^ 
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(26) 


At this level, each 0^- is assigned to a (pk that are drawn from H, the base distribution of 
the nHDP. Given the (/>fc that corresponds to 9j^, the observed data is generated as F((/)fc). 
Note that each of the conditional distributions for 9j- and are similar to that for CRF 
(Eqns. [7]and[8|). We interpret these distributions as a nested Chinese Restaurant Franchise 
(nCRF), involving CRFs with multiple levels of nesting. 

We now describe in detail the restaurant analogy for the nested Chinese Restaurant 
Franchise. The nCRF comprises of multiple levels of CRF. At each level I, there exist 
multiple restaurants {G(,}, each containing a countably infinite number of tables. Each 
table t in restaurant r of level I is associated with a dish from global menu of dishes 
specific to that level. {G^} is the distribution over the dishes in the global menu at level I 
modeling the global popularity of the dishes. 

Imagine a customer on a culinary vacation. We trace the journey of this customer to 

show the process of generating Xji, the i^^ word in the document through the dishes 

he selects at restaurants at various levels. The customer first enters the restaurant j in 

the outermost level L as the customer and choses a table with index t^^, based on 

the popularity of the table governed by . Each table t in this level L restaurant j is 

associated with a dish from a global menu at level L. Each of these dishes has a one- 

to-one correspondence with a unique restaurant at level L — 1, leading to nesting between 

CRF levels. We use the variable 9h = to denote the level L dish thus chosen by 

the customer, through his table selection, and to denote the index of the dish within 

the global menu and to denote the level L — 1 restaurant corresponding to the dish 

chosen. The customer now enters the restaurant rh at level L — 1 and repeats this process 

by selecting a table based on the distribution . 

’’’ji 

At any intermediate level I, the customer enters the restaurant governed by the dish 
chosen at the previous level. He then selects a table Each table t in this restaurant has a 
dish Fit from the global level I menu governing the dish 9^,- chosen by the customer.Each dish 
k in the global menu corresponds to a unique restaurant in the previous level. This process 
continues where at level 0, the customer enters restaurant rl governed by the dish selected 
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Notation 

Description of Notation 

1 

level index indicated in a superscript 

r 

Restaurant index 

3 

Document Index (Used instead of r as index of observed group/restaurant at outermost level L) 

i 

Word (customer) Index within document 

k 

Dish index in various contexts 

K'- 

Number of dishes in the global menu at level 1 

rpl 

1 

Number of tables in restaurant r of level 1 

L 

Index of the outermost level (/ € {0, .. ., L}) 

Xji 

word observed in document 


Table index assigned to word i of document j for level 1 

k^ 

^rt 

Dish index assigned to table t of restaurant r at level 1 

z>- 

Dish index at level 1 assigned to word i of document j 


Restaurant index at level 1 (also level 1 + 1 dish index) for word i of document j 


k^^ dish in the global menu at level 1 


Dish assigned to table in restaurant at level 1 


Dish assigned to word of document at level 1 


Number of customers at table t in restaurant r in level 1 

^r.fc 

Number of tables restaurant r in level 1 that got assigned dish k 

H 

Base distribution of nHDP 

W 

Base distribution of the HDP at level 1 + 1 : iT' = HDP{a‘,'y‘, 

G‘p 

Base distribution at level 1 for group level DP at level 1 

G} 

rtn Qroup level distribution at level 1 

a'' 

Concentration parameter of the group level DP at level 1 

7 ' 

Concentration parameter of the base DP at level 1 


Table 1: Table describing notation used 


in level 1. The customer then chooses a table which is associated with a dish -0^ = 0^, 
say for some k G {1,2,...}. The word Xji ~ F((f>k) is generated from the corresponding 
innermost level dish (topic) 


4.4 Variations of multi-level nHDP 

Recall that at any given level I € {1, 2,... , L — 1} of (L+l)-nHDP, HDP distribution of the 
previous level becomes the base distribution of the level HDP, while the group level 
distributions at the previous level, ,r G 1 , 2 ,..., become the atoms for the group level 
distributions at the level, r = 1, 2,... ,. This leads to multi-level admixture modeling 
where each entity at level I models a distribution over entities at level I — 1. However, one 
can also consider a variation where entities at a given level are associated with a single 
entity at the previous level leading to a mixture instead of an admixture at this specific 
level. In other words, we replace a given level HDP with a DP to associate a single level-/ 
entity with the group at next level. This leads to multi-level model with admixtures at some 
levels and mixtures at other levels. We note that the DP-HDP model(for grouped data) that 
associates a single entity for each document /section l4.ll) is an instance of such a variation. 
While these variations open avenues for investigating a new set of modeling techniques, we 
restrict our work to multi-level admixture modeling. Inference in these models should be 
an extension to that of our admixture model (refer section 5?). 


16 





































G°b ~ DP(7°,J?) 

re{l, ...} G° DP(a°,G%) = H° 


G'-g^ ~ 

rG{l,...} G‘-^ DP(a‘-^,G‘-^) = 


G‘g ~ DP{-y‘,H‘-^) 

r-e{l,...} Gl r-. DP(a\G''g) = H‘ 


Gg ~ DP{'y^,H^-^) 

je{l,...,M} Gf DP(a^,G^) 


Figure 1: Pictorial representation of Nested Chinese Restaurant Franchise (nCRF) on the 
left and the corresponding nHDP on the right 


4.5 nHDP as Infinite Limit of a Multi-level Finate Mixture Models: 


A Dirichlet process mixture model can be derived as the infinite limit of a finate mixture 
m odel as the numb e r of m ixture componants tends to infinity[Eshwaran and Zaphaeur], 


In IY. Teh and BleiJ (j2006l i , the authors have shown a similar result where a HDP can be 


constructed as an infinite limit of a collection of finite mixture models. We show a similar 
result for nHDP as an infinite limit of multi-level finite mixture models. 


We first define the following collection of finite mixtures. Consider a multi-level setting, 
with I G 0,..., L denoting the level, where each level has multiple group level distributions 
Gl\Ki, r G {1,... , and a base level distribution Gq|^z- Note that we use the notation 

^r\K‘ denote that the distribution has a finite number (K^) of atoms. Further, these 
group level distributions at each level I form the atoms of the next level I -\- 1 defining 


17 



















































































multiple-levels of finite mixtures as follows. 





K° 

o 

o 

^Dir{^,. 

7° 

..,^) and 



k=l 




RO 

Vr G {1,... 

O 

o 

~ Dir{a^j3^) 

and G^^j^o = ^ T^rkHk 




k=l 


For each level 1 G {1,.. 



n- 

^Dir{—^,.. 


k‘ 

aX 

7^ 

.,X) and 

^0\K‘ ~ f^k^G'--^ 


k=l 





Vr G {1,..., 


~ Dir{a^X 

and G|.|^i = X^Gfi 




k=l 


(27) 


Theorem 1 For each I € {0,..., L}, with and defined as above, as —)■ 

oo,\/l G {0,... ,L}, —>■ Gq (with Gq as defined in section\4^, and'ir G {1,... 

G^^\ki ~ DP{a^,G q), tending to a draw from an l-level nHDP. 

Proof We note that as —)■ oo, Gqi^^o Gq where the convergence of measures 

is defined by f gdG(^^j^Q J gdG^ for all real valued functions g measurable with re¬ 
spect to Gq as shown by [Ishwaran and Zarepour, 2002]. We note that we already have 
G)]|^o ~ DP{a^ This follows from the definition of the DP since G)]|^o fol¬ 
lows equation [2] with respect to the base measure Gq|^o and the concentration parameter 
a^. As —>■ oo, we have already established that Gq|^o —t Gq. Hence it follows that 

GO^o DP{a^,G^,) as oo. for each level I G L}, having proved this result 

for all previous levels, assuming oo,\/l' < I, we can make a similar argument for level 

I as that for level 0 to conclude G^i^i — Gq, and Vr G {1,... ,Gj,|^i ~ DP{a\G\f). 
This concludes the proof. ■ 


Alternate construction based on the nCRF: The following alternate construction 
based on the nCRF, is another way to show nHDP as an infin ite limit of a collection of 
finite mixture models, similar to that in Iy. Teh and BleP (l2nnfili , using the table and the 
dish indices of nCRF from the restaurant analogy as follows. 


For level / G {1,... ,T}, /3*|7* ~ Dir{ 


X 

pr 


Vr G {1,..., X}, 7r(.|a^ ~ Dir{ 


, a 


a 


'j'W ''' ^ p' 




) 


X 

Rl' 

Vt G {1,. 


■ X} 
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In level L, for each outermost group j € {1,..., M}, for each observation, i G {1,..., Nj}, 


+L \L L 

r . . Tf . rsj TT . 


and 




V/g{L-1,...,0}, 




^ = r TT^ ~ TT^ 

? '*r '*r 


31 - 

and 


(28) 


= k’- 


Theorem 2 For each / G {0,..., F}, FT* ^ oo, and T/ oo, Vr G {1,..., the 

generative process described above in equation [Ml is equivalent to the nHDP. 

Proof At each level t, As —> oo, and T/ —>■ oo, Vr G {1,... ,F'^+^}, the predictive distri¬ 
bution of the draw from each Dirichlet distribution above tends to a CRP and hence draws 
of in the above construction in equation [28] are the same as that from nCRF described 
in the previous section. Hence the multi-level hnite mixture model in [28] tends to nHDP in 
the infinite limit. ■ 


In the case of L=l, with a single level of nesting, we once aga in note the similarities be¬ 


tween the two-level nHDP and the the Author Topic Model (ATM) IM. Posen-Zvi and Smvth 
( 2 OO 4 I I . With F = 1 and referring to the index of the outer most group with j instead of 
r, {G]} parallel the distribution over authors in each document (uniformly distributed in 
ATM) while {G^} parallel the authors’ distribution over topics. We note that the hnite 
version of two-level nHDP additionally models the base distributions {G)j} for the global 
popularity of authors and {G^} for the global popularity of topics leading to a generalization 
of ATM. 


5. Inference 

We use Gibbs sampling for approximate inference as exact inference is intractable for this 
problem. The conditional distributions from the nCRF scheme lend themselves to an infer¬ 
ence algorithm, where we sample at every level ^ G {0,..., F}, the table assignments for 
customers, and dish assignments for tables where 1 < t < T/ and restaurant identiher 
r G {1,...,F:^+1} (Recall r is a restaurant identiher at level I and the number of restaurants 
in level I is same as the number of dishes in level I -I-1). Note that for the outermost level F, 
= M, i.e the number of restaurants is the number of observed groups (or documents) 
in the outermost level. 

The conditional posterior for Gibbs sampling for these variables can be derived from 
the nGRF conditionals. However, in such an approach, unlike the inference for a single 
level HDP, a naive approach of sampling all the above indices is intractable leading to 
an exponential complexity at each level due to the tight coupling between the variables. 
In this section, we hrst briehy describe such an nGRF inference technique (Scheme 1) by 
sampling all the variables involved in the nGRF formulation to illustrate the computational 
intractability that arises due to the exponential complexity of this algorithm. Following this, 
we describe in more detail an alternate scheme (Scheme 2) based on the direct sampling 
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technique of HDP that overcomes this problem, that we use for experiments in section [7] 
for entity-topic analysis. 

In the appendix 19.11 we also discuss scheme 1 in more detail for a special case, the 
two-level nHDP using which we experimentally demonstrate the difference in complexity 
between the two schemes. 

5.1 Inference Scheme 1: nCRF Inference 

In the basic nCRF scheme, the latent variables to be sampled as a part of the Gibbs sampling 
procedure are the assignment of tables to each customer i belong to the observed group j 
and dishes to tables at different levels. Hence, we wish to sample at every level, and k\.i 
where r G {1,..., is a restaurant index at level I also corresponding to a dish in level 

I + 1 and t G {1,... ,TI} is a table index in restaurant r. We start by sampling variables 
in level 0, the deepest level, proceeding to variables in level L. We attempt to illustrate in 
this section, how the complexity of sampling increases, reaching exponential complexity, as 
we go from sampling variables in level 0 through level L. 

The following minor additions to notation are introduced for convenience during infer¬ 
ence. We denote the set of all observed data as x = {xji : Vj, zj.We denote the set of 
all customers going to table t of restaurant r in level 1 as xj.^ = {xji : = r, fb = t}. 

Further, a set with a subscript starting with a hyphen(-) indicates the set of all elements 
except the index following the hyphen. 

We start with sampling of level 0 dish assignments to tables, conditioned on values 
of table and dish assignments at all other levels. Hence, we sample as follows, for 
r G {1,..., K^}, t G {1,..., T^} by integrating out (using Eqn. [26|) 

= ^|k°r,t>k~°,t,x.,) oc (29) 

1where k e {1,.. ■,K^} 

1 + 1)level 0 dish 


The first term is obtained from the conditional probability of the CRP for choosing level 
0 dishes. We note that the likelihood terms p(x[?^ \krt = t,k °,t) and p(x°t|A:°i = 

+ 1, k° j. k~°, t) arise from the probability of all observed data or customers that go to 
table t of restaurant r at level 0 that are affected by the assignment = k. These terms 
can be simplified by integrating out the appropriate cj) variables corresponding to the topic 
multinomials. (A detailed evaluation for these terms is shown in appendix ?? for a special 
case of this inference algorithm for ungrouped data). We further note t hat this update is 
simila r to that in the direct sampling scheme for a single level HDP in 

(l2006l i]. 


Y. Teh and Blei 


For the next level, we sample the update for dish assignment to tables belonging to level 
1 restaurants, kj^, for each r G {1,... , K^}, t G {1,..., T^}. Let : fk = t, zk = r}, 

the set of level 0 table assignments corresponding to all customers j, i who have been assigned 
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the table t in level 1 restaurant r, 

oc (30) 

I :^r^p(xrtl^ri = where k = l,...,K^ 

[ 7 T^p(xrtl^rt = + l,kir,t>k~^{-S);J) new level 1 dish 

We note that the likelihood terms are conditioned on all table assignments except those 
in the set S):^ since changing the level 1 dish assignment k^^ of the table in restaurant r 
changes the level 0 restaurant that the customer enters, due to which table assignments in 
set are not known. 

Hence, evaluating the likelihood term requires marginalizing over all possible assign¬ 
ments for latent variables S);^. We note that each of these variables can take a value 
between 1,..., -|- 1. This leads to (T^ -g i)l®rtl operations to simplify the likelihood term 

leading to an exponential complexity for evaluating the update k\^ rendering this inference 
technique intractable. 

We see that similarly, for a general level I, sampling k\.^ for k =€ {1,..., t G 

{1,..., r®} requires the marginalization over the following set S(.j of all table assignments in 
all previous levels I < I for the customers sitting at the particular table t in level I restaurant 
r : 

Kt = ^ 

We see that the cardinality of this set increases exponentially with increasing I due to which 
this technique is intractable for a general I, the only exception being I = 0 for a single level 
HDP where this technique is tractable as in equation 1291 


5.2 Inference Scheme 2: Direct Sampling Scheme 


To work around the exponential complexity encounter ed in the previous sectio n, we adopt 
a technique similar to the direct sampling scheme in Y. Teh and Blei. ( 2006l i] where the 
variables and kl^,yi, j,i,r,t are not explicitly sampled. Instead the variables Gq are 
explicitly sampled for all levels I as opposed to being integrated out, by sampling the stick 
breaking weights respectively. Further, we directly sample the dish assignment at 
level I for each customer (word) i, in each group(document) j, avoiding explicit assignments 
of tables to customers and dishes to tables. However, in order to sample the table 
information is maintained in the form of the aggregated counts in each layer, the 

number of tables at level I, in restaurant r G {!,..., assigned to dish k € {1,K'^}. 

(Recall that each restaurant at level I corresponds to a unique dish in level I + 1. Hence, 
r G {1,... , ) Thus the latent variables that need to be sampled in the Gibbs 

sampling scheme are /3*, \/l,i,j,r,k. 

We introduce the following notation for the rest of this section. Let x = (xji : all j,i), 
x„ji = {xfi' : f + hi' / i), m = : all r,k,l), z = (zj^ : all j,i,/), = (2;^^ : / + 

hi' 7^ i,l' 7^ and = Ylk =K‘+i A aow provide the sampling updates for dish 

assignments for customers at each level, starting from level 0, conditioned on all other dish 
assignments at all levels. 

Sampling The conditional distribution for the dish assignment at level 0, 2 ;^^, 

depends on the predictive distribution of the dish assignment Zj^, given all other dish as- 
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signment to customers at this level and all other levels and the emission probabilities of the 
final observed data Xji with the specific dish assignment. This is given by 

p{Zji = p\z]i = r, z°ji, m, 13, x) oc p{zji = p\z^y„ z]^ = r)p{xji\z^ji = p, x_ji) 


To pick dish p at level 0, conditioned on the dish assignment at level 1 as r, the first 
term p{Zj^ = ji, zj^^ = r) can be split into two parts. One for picking any of the existing 
tables from the level 0 restaurant r that get mapped to dish p and one from creating a new 
table in restaurant r and assigning dish p to it. In the instance of choosing a new dish, a 
new table is always created in restaurant r at level 0. Hence, 


p{z% = p\z]i = r, z° ji) oc 


1 °p+Q:/ 3° 

nr,.-\-Op 

+a^ 


Existing dish 
New dish 


(31) 


The likelihood term p{xji\z^^ = p,x_y) is the conditional density of xji under level 
0 dish(topic) Zj^ = p given all data items except Xji. Assuming the 0 level dish is a 
topic sampled from a V dimensional symmetric Dirichlet prior over the vocabulary with 
parameter p, i.e (f)p ~ Dir{p), the conditional can be simplified to the following expression, 
by integrating out (j)p. 

r ,0 , ^pw + V 

p{xji = w\Zji = p, x„ji) oc 


where rip^ is the number of occurrences of level 0 dish(topic) p with word w in the vocab¬ 
ulary. We note that this step is similar to that in Y. Teh and Blei. ( 20061 )]. 

For any general level, sampling z^y. The conditional distribution for the dish 
assignment at level I is compnted as 


P{z\i 

oc p{z\i = 


= = r,z]^ ^ 

= r)p{z\r^^ 


= 9,z*-ji,m,/3,x) 

= 9|z‘rji,4 = p) 


The first term is the predictive distribution of given the next level dish assignment 
r (to specify which level I restanrant the customer goes to), while the second term arises 
from the previous level dish assignment q that depends on the valne of Zj^. Again, p{z^j^ = 
p|z]_jj, = r) can be viewed as consisting of two terms. One from picking an existing 
table in restanrant r with dish assignment p and one from creating a new table in restaurant 
r at level I and assigning the dish p to it. Further, creation of a new dish always involves 
the creation of a new table. Hence, 


Similarly 


, 1 ,1 ; 4 _i X I fp Existing dish 


n‘ -1-0' 


= 


=p)oc 


o3‘~^P\^e.}n 


Existing dish 
New dish 
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Sampling /3 : At each level I, the posterior of Gq, conditioned on samples observed 
from it, is also distributed as a DP due to Dirichlet-Multinomial conjugacy, and the stick 
breaking weights of Gg can be sampled as follows: 


iP[, Pi ■ ■ ■ Pnew) ~ - 


Sampling m 


m 


rk 


is the number of tables in level I restaurant r G { 1 ,... , that 

is the number of tables 


are assigned to the level I dish /c G {1,..., AT^}. In other words, 
created as ^ samples are drawn from GJ. in restaurant r that correspond to a particular 
dish k. This is the number of partitions generated as samples are drawn from a Dirichlet 
Process wi th concentration parameter and are distributed accord ing to a closed form 


expres sion Antoniak.1 (jl974l i]. However, we adapt an alternate method E. Fox and Willskv. 


( 2nill )] for sampling by drawing a total of samples with dish k, and incrementing 
the count whenever a new table is created in restaurant r with dish assignment k. 

Sampling Concentration parameters: We place a vague gamma prior on the 
concentration parameters aP 7 ^ Ml with hyper parameters OajCeftjTajTb respectively. We 
use Gibbs s ampling scheme for sani pling the concentration parameters using the technique 


outlined in Y. Teh and Blei. ( 2006ll ]. 


6. Experimental evaluation of inference complexity 

The nCRF scheme (Scheme 1) is computationally more expensive than the direct sampling 
scheme. Scheme 1, as described in section [5Tl runs to exponential complexity even for the 
2 level nHDP. Hence, we introduced the direct sampling scheme in section to outline a 
tractable inference algorithm. In this section, we illustrate this through some examples. 

First we perform experiments with the single level nHDP, to compare the inference 
(training time) with both these schemes. The results of this experiment is shown in figure 
[2al We also compare the perplexity obtained on held out test data with both these schemes 
for the single level nHDP. The perplexity results on the NIPS dataset with 20 percent of the 
documents held out is shown in table [2j We note that while the nCRF scheme (scheme 1) 
performs better in terms of perplexity, the direct sampling scheme is faster. This difference 
in complexity increases exponentially as we add more levels to the nHDP. 

To better illustrate the difference in computational complexity between the two schemes, 
in this section, we compare the runtime of these two algorithms for a special case of our 
2-level nHDP model where there is a single restaurant in the outer most level. In this special 
setting, at the outer most level, the HDP can be replaced by a simple DP since there is no 
sharing of atoms required between restaurants. We discuss both the naive nCRF and the 
direct sampling inference algorithm for this setting in detail in appendix 19.11 We perform 
experiments on a 100 document subset of the NIPS dataset to compare runtime in this 
special setting. The results are shown in hgure I2b[ We note that the direct sampling 
technique is order of magnitude faster with respect to runtime complexity as illustrated in 
the hgures [ 2 H 
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time in seconds 


time in seconds 


(a) Single level nHDP (HDP) : We see that di¬ 
rect sampling is faster than the nCRF scheme. 
However the difference is not as pronounced in 
the case with more levels 


(b) two-level nHDP for ungrouped data : We see 
that direct sampling runs for several iterations 
before the nCRF technique completes the first 
few Gibbs sampling iterations 


Figure 2: Comparison of Runtime : Direct Sampling vs nCRF 


Model 

Direct Sampling 

nCRE 

Perplexity 

2230 

1937 


Table 2: Perplexity comparison for single level nHDP for NIPS dataset with different infer¬ 
ence schemes 


7. Non-Parametric entity-topic Modeling : Experimental Analysis 

In this section, we experimentally evaluate the proposed nHDP model in the context of non- 
parametric entity-topic modeling, with a two-level nHDP, for the task of modeling author 
entities who have collaboratively written research papers, and compare its performance 
against available baselines. Specifically, we evaluate two different aspects: (1) how well the 
model is able to learn from the training samples and fit held-out data in terms of perplexity, 
(a) first, when all the authors are observed in training and test documents, and (b) secondly, 
when some of the authors are unobserved in training and test documents, (c) finally, when 
all authors are unobserved, to understand the effect of multi-level HDP in comparison with a 
single level level HDP on perplexity. (2) how accurately the model discovers hidden authors, 
who are not mentioned at all in the corpus. 

We consider the following models for the experiments: (i) The author-topic model(ATM) 
(Eqn. [TT]) where the number of topics is pre-specified, and all authors are observed for all 
documents. This is used as a baseline for (la) above, (ii) The Hierarchical Dirichlet Process 
(HDP) (Eqn. [5]) using the direct assignment inference scheme for fair comparison. We use 
our own implementation for this. Recall that the HDP is infers the number of topics, and 
does not use author information, (iii) nHDP with completely observed entities (nHDP-co), 
which assumes complete entity information to be available for all documents, but is learns 
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Model 

ATM 

HDP 

nHDP-co 

Perplexity 

2783 

1775 

1247 


Table 3: Perplexity of ATM, HDP and nHDP-co for NIPS 


topics in a nonparametric fashion. This can be imagined as an improvement over ATM 
where the number of topics does not need to be specified, (iv) nHDP with partially ob¬ 
served entities (nHDP-po), which makes use of available entity information, but admits 
the possibility of entities being hidden globally from the corpus, or locally from individual 
documents, (v) nHDP with no observed entities (nHDP-no), which does not make use of 
any entity information and assumes all entities to be globally hidden in the corpus. For 
task (la) above, the applicable models are the ATM, HDP (which ignores the entity in¬ 
formation) and nHDP-co. For task (lb) and (Ic), the ATM does not work. We evaluate 
HDP, and nHDP-po / nHDP-no. It is important to point out that there are no available 
baselines in terms of entity-topic analysis for task (2) above when some or all of the authors 
are unobserved. 

We use the following publicly available publication datasets for our experimental anal¬ 
ysis. The NIPS datase10 is a collection of papers from Neural Information Processing 
Systems (NIPS) conference proceedings (volume 0-12). This collection contains 1,740 doc¬ 
uments contributed by a total of 2,037 authors, with total 2,301,375 word tokens resulting 
in a vocabulary of 13,649 words. A subset of the DBLP Abstracts dataselH containing 
12,000 documents by 15,252 authors collected from 20 conferences records on the Digital 
Bibliography and Library Project (DBLP). Each document is represented as a bag of words 
present in abstract and title of the corresponding paper, resulting in a vocabulary size of 
11,771 words. 


1. Generalization Ability: We now come to our hrst experiment, where we evaluate 
the ability of the models, whose parameters are learnt from a training set, to predict words 
in new unseen documents in a held-out test set. We evaluat e performance of a model M 


D. Blei and Jordan 


on a test collection D using the standard notion of perplexity 
exp{-Yld&DPiwd)\M). 

In experiment (la), all authors are observed in training and test documents. To favor 
the ATM, which cannot handle new authors in test document, we create test-train splits 
ensuring that each author in the test collection occurs in at least one training document. 

Perplexity results are shown in Table [3l Recall that HDP and nHDP finds the best 
number of topics, while for ATM we have recorded its best performance across different 
value of K. The results show that while knowledge of authors is useful, the ability of non¬ 
parametric topic models to infer the number of topics clearly leads to better generalization 
ability. 

Next, in experiment (lb), we first create training-test distributions with reasonable 
author overlap by letting each author vote with probability 0.7 whether to send a document 
to train or test, and majority decision is taken for each document. Next, authors are 
partially hidden from both the test and the train documents as following. We iterate over 


1. http;//www.arbylon.net/resources.html 

2. http://www.cs.uiuc.edu/ hbdeng/data/kdd2011.htm 
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Model 

HDP 

nHDP-no 

nHDP-po 

nHDP-po 

nHDP-po 

nHDP-co 

Pg,Pl 

1,1 

1,1 

0.6,0.6 

0.4,0.4 

0.2,0.2 

0,0 

Perplexity NIPS 

2572 

1882 

1434 

1266 

1109 

987 

Perplexity DBLP 

1027 

997 

935 

869 

676 

394 


Table 4: Perplexity for HDP and nHDP with varying percentage of hidden authors 


the global list of authors and remove this author from all training and test documents 
with probability pg. We then iterate over each training and test document, and remove 
each remaining author of that document with probability pi- We experiment with different 
values of pg and pi to simulate different extents of missing information on authors, pg = 1 
and Pi = 1 corresponds to (Ic), the case where authors are completely unobserved. This 
setting enables us to compare the two-level nHDP, with completely unobserved dishes at 
each level, with a HDP, to understand the relative merit of multi-level modeling over a 
single level in terms of perplexity. 

The results are shown in Table 01 We can see that more information available about the 
authors, the ability to fit held-out data improves. More interestingly, even when no / very 
little author information is available, just the assumption about the existence of a discrete 
set of authors, i.e introducing an additional layer of HDP, leads to better generalization 
ability, corroborating the need for multi-level modeling, as can be seen from the relative 
performance of HDP and nHDP-no. 

2. Discovering Missing Authors: Beyond data fitting, the most signihcant ability 
of our model is to discover entities which are relevant for documents in the corpus, but 
are never mentioned. We perform a case study with the top 6 most prolihc authors in 
NIPS, by removing them completely from the corpus, and then checking the ability of the 
model to discover them in a completely unsupervised fashion. While it is possible to define 
as a classification problem the task of detecting of locally missing authors in individual 
documents when the author is observed in other documents, we reiterate that there is no 
existing baseline when an author is globally hidden. 

We evaluate the accuracy of discovering hidden author as follows. For each hidden author 
/i € {1... Ff}, we create a m-dimensional vector c^, where m is the corpus size, with Ch[j] 
indicating his authorship in the document. We explored two possibilities for this ‘true’ 
indicator vector: (a) binary indicators using the gold-standard author names for documents, 
and (b) the number of words written by that author in the document according to nHDP 
with completely observed authors (nHDP-co). Similarly, we create an m-dimensional vector 
for each new author n € {1... A} discovered by the nHDP-po, with c„[j] indicating his 
contribution (no. of authored words) in the document. We now check how well the 
vectors {c^} correspond to the ‘true’ vectors {ch}- This is done by defining two variables 
Cn and Ch, taking values 1... H and 1... A respectively, and defining a joint distribution 
over them as P{h,n) = ^sim.{ch,Cn), where Z is a normalization constant. For sim(c/i,Cn), 
we use cosine similarity between normalized versions of Ch and Cn. Mutual information 
I{Ch,Cn) = Ylh p{h)^n) measures the information that Ch and Cy share. We 

used its normalized variant NMI{Ch,Cn) = \h{cI^+h{c )|/2 (-^(^) indicating entropy of 
X) which takes values between 0 and 1, higher values indicating more shared information. 
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First, we note that the best NMI achievable for this task, by replacing the true vectors 
{ch} for the discovered vectors {cn}, is 0.86 for case (a) and 0.98 for case (b) above. In 
comparison, using nHDP-po, we achieve NMI scores of 0.59 for case (a) and 0.72 for case (b). 
This indicates that the actual author distributions that the model discovers not only help 
in htting the data, but also have reasonable correspondence with the true hidden authors. 
We believe that this is a promising initial step in addressing this difficult problem. 

8. Conclusions 

In this paper, we have proposed the the nested Hierarchical Dirichlet Process as a prior for 
multi-level admixture modeling. We have also addressed the problem of entity-topic analysis 
from document corpora, where the set of document entities are either completely or partially 
hidden through the two level nHDP, which consists of two levels of Hierarchical Dirichlet 
Processes, where one is the base distribution of the other. We explore inference algorithms 
for nHDP and using a direct sampling scheme for inference, we have shown that the nHDP 
is able to generalize better than existing models under varying available knowledge about 
authors in research publications, and is additionally able to discover completely hidden 
authors in the corpus. 
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9. Appendix 

9.1 Two-level Inference with Ungrouped data at Outermost Level 

In this section, we describe the collapsed Gibbs sampling inference for the setting with un¬ 
grouped data at the outermost level in the entity-topic application for document modeling. 
This is a special case of the two level nHDP model with a DP in the outer level instead 
of a HDP. While we use notation similar to the nHDP inference described in section [5j 
the observed data is indexed by a single index i (the index j vanishes since there is no 
demarcation into groups i.e documents at the outermost level). The nCRF representation 
for this setting involves assigning a dish(entity) zj with index kj to every customer i based 
on G^, the global distribution over entities based on which the customer enters an inner 
level restaurant r^. At this restaurant the customer picks a table t = t^ which is assigned 
a corresponding dish z? with index kj that corresponds to a topic. The observed data is 
generated based on the topic assignment thus attained. 

We now describe the two inference schemes described in section [5] for the two level nHDP 
for entity topic modeling for this special case of ungrouped data. Note that in section [6l an 
experimental comparison of both these schemes is shown. 
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Scheme 1: Naive nCRF based Sampling for entity-topic modeling of ungrouped data: 
The latent variables to be sampled include t^, kj for each observation i = 1,..., M and 
for restaurant r = 1,...,and table t = 1,... . Sampling is similar to that in 

the full nCRF inference procedure described in the previous section and is not described 
here. 

The update for selecting the level 0 table t^ for each customer can be obtained as follows 
by integrating out the appropriate 


Pit^i = i, zl 


r\t°_i,X,k) oc 


= t) Existing t = 1,..., r° 
nf+aC P(^i\*i = '^r + l)New table i = T° 


(32) 


Where p{xi\t^ = + 1) can be evaluated as follows considering the different level 0 dishes 

that can be assigned to the new level 0 table. 


p{xi\t^i = Tl 


ro 

+ i) = E 


m" 


r=l 


+ y 


:p{xi\z^ = r) + 


7 


m'-’ + y 


-p{x^\z^,=K^+ 1) 


The overall cost of this update step is 0(M K^). 

The update for kj can be obtained by integrating out as follows. 


p{kj = r\kfi,X,t-i) oc 


:;^^^p{xi\k} = r) Existing r = I,..., 
^i^y p{xi\kl = + l)New table r = -\-1 


(33) 


Changing the value of kj, invalidates the existing assignment to t^, Hence evaluating 
p{xi\kj = r) requires summing over possible values of t^ as follows 


p{xi\kj =r) = Y^ 0 ^ o Pixi\t^i = i kj 
^ nT + 

t=i rt 


r) + 


a 


-p{xi\fl = 


nT + N 

rt 


r) 


In turn, p{xi\t^ = T^, kj = r), corresponds to the case where a new level 0 table is created 
and requires summing over all potential value of level 1 dish assignments for this table. 
Hence, p{xi\tj = Tj},kj = r and similarly p{xi\kj = + 1) that involve the creation of a 

new level 1 table, can be evaluated as follows, 


pixi\tj = T\ 


ki =r) = Y^ 


m" 


r=l 


+ y 


:p{xi\zj = r) + 


7 


+ 7 ' 


-Mxi\zj = + 1) 


Scheme 2: Direct Sampling for entity-topic modeling of Ungrouped data 
The latent variables involved in the direct sampling scheme are zj and zj for each ob¬ 
servation i = 1,... ,M. Sampling zj is similar to that for the case of full nHDP direct 
sampling 


p{zj =p\zj = r,z®i,m,/3,x) (x p{zj =p\z%zj = r)p{xi\zji =p,x_i) 
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The first term can be expanded to the following, similar to that in the full nHDP with 
defined in section [T2J 


p(zO=p|^i=r,z°i)cx 


^r.p+aPp 


nF +0' 


W 


New dish 


( 34 ) 


The second term can be simplified similar to section 15.21 by integrating out the (j) multino¬ 
mials corresponding to the dishes in the inner most level. 

The update for zf can be similarly obtained as 

p{zj = p\z\i, z^ = q, zi;, m, /3, x) 
oc p{zj = p|zLi)p(z° = 5|z“ i, zl = p) 

The first term is the conditional of a simple CRP while the second term simplifies as 

n5 +Q!°/3,° 


p{z^i = g|z° i, zl = p) OC 0^1+" 


0 Existing dish 
New dish 
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OqO 

r'new 

15- 


IT 


30 







